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Abstract — We introduce the notion of relevant information loss 
for the purpose of casting the signal enhancement problem in 
information-theoretic terms. We show that many algorithms from 
machine learning can be reformulated using relevant information 
loss, which allows their application to the aforementioned prob- 
lem. As a particular example we analyze principle component 
analysis for dimensionality reduction, discuss its optimality, and 
show that the relevant information loss can indeed vanish if 
the relevant information is concentrated on a lower-dimensional 
subspace of the input space. 

I. Introduction 

It is a widely known fact from information theory, that 
processing a random variable, or a signal, cannot increase 
the amount of information it represents. In fact, as the data 
processing inequality states, information can only be lost 
by passing a signal through a deterministic system. This 
information loss can - very loosely speaking - be interpreted 
as the difference between the information at the input and the 
information at the output of the system. In case of continuous- 
valued random variables (which contain an infinite amount of 
information) the authors developed a theory quantifying the 
information loss both in absolute and relative terms 0], ||2). 

We now look at the data processing inequality (DPI) from 
a different point of view by asking the following question: 
How can we justify signal processing, knowing that it can 
only reduce information? Whenever technical systems prepare 
a physical, information-carrying signal for human perception, 
processing occurs and, presumably, a significant amount of 
information is lost. The only justification for this approach 
is that the information which is lost is actually not the 
information we are interested in, but rather some nuisance. 
We hope to preserve all the information relevant to us while 
minimizing the nuisance. The problem of signal enhancement, 
as we will argue later, is an optimization problem which can 
be cast in exactly these terms. Aside from that, each block in 
the signal processing chain has the goal of representing the 
relevant information such that as little as possible is lost in 
the subsequent blocks; as in the sense of [3|, preprocessing 
can improve performance, while postprocessing cannot - we 
cannot recover lost information, but we can prevent loosing it. 
It is our goal to make these statements precise by supporting 
them with a mathematical framework of relevant information 
loss. 

Relevant information and its counterpart, relevant infor- 
mation loss, are concepts not altogether new. Indeed, the 



notion of relevance has a long history in machine learning 
and neural networks: Plumbley, essentially using the same 
definition as we do, explicitly used relevant information loss in 
analyzing the properties of principle component analysis J4), 
0. The information bottleneck (IB) method and its variants 
directly maximize relevant information while minimizing the 
complexity of its signal representation [6 |. Relevance w.r.t. a 
specific goal was the motivation for defining an information 
processing measure for neural networks in [0, []8]. Finally, the 
principle of relevant information, being structurally similar to 
the information bottleneck formulation, builds on minimizing 
the relative entropy between the relevant random variable and 
its representation J9]- 

Conversely, in signal processing the notion of relevant 
information has not gained foothold yet. In this paper we thus 
formulate a definition of relevant information loss (Section Hill 
following the ideas of (4) and discuss its properties from 
the view-point of system theory (Section fTITb - On the basis 
of these properties we then discuss the problem of signal 
enhancement in Section |IV] and investigate the connections 
to the machine learning literature. As a special example of 
signal enhancement we analyze principle component analysis 
(Section Ml. generalizing the results of Finally, we give 
a brief example of the application of our results to a simple 
digital communication system (Section IVTl ). 

II. A Definition of Relevant Information Loss 

We start with recalling the definition given in 0], where 
the information loss induced by transforming a (possibly 
multidimensional) random variable (RV) X to another RV Y 
by a static function g: X — > y, X, y C M. N was given as 



L(X -> Y) 



sup 
v 



I(X;X)-I(X;Y)) = H(X\Y) (1) 



where the supremum is over all partitions V of the sample 
space X of X, and where X is obtained by quantizing 
X accordingly (see Fig. |l(a)) . This definition of absolute 
information loss is accompanied by an expression for the 
relative information loss in 0. 

As the examples in [2] show, both the absolute and the 
relative definition of information loss have their shortcomings, 
especially when it comes to systems g used for signal enhance- 
ment: Since the expressions only consider the RV X at the 
input of the system, they do not take into account that not all 
of the information contained in X is relevant, often leading to 
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(b) Computing the relevant information loss Lg(X — > Y) 



Fig. 1. Model for computing the information loss of a memoryless input-output system g. Q is a quantizer with partition "P. 



counter-intuitive results. The main contribution of this work 
lies thus in analyzing the implications of the following 

Definition 1 (Relevant Information Loss). Let X be an RV on 
the sample space X, and let Y be obtained by transforming X 
with a static function g. Let S be another RV on the sample 
space S representing relevant information. The information 
loss relevant w.r.t. S is defined as 



L S (X -> Y - ) = sup (/(S; X) - 10; Y)) = /(X; S|Y). 

(2) 

where the supremum is over all partitions of the sample space 
S, and where S is obtained by quantizing S accordingly. 

The information loss relevant w.r.t. an RV S is thus the 
difference of mutual informations between S and the input 
and output of the system (see Fig. |l(b)) . Equivalently, the 
relevant information loss is exactly the information that X 
contains about S which is not contained in Y. Due to the data 
processing inequality (e.g., [10|) it is a non-negative quantity, 
i.e., a deterministic system cannot increase the amount of 
available relevant information. 

As we have already pointed out in the introduction, this 
definition of relevant information loss is not altogether new: 
Plumb ley already introduced this quantity (named AIs(X; Y), 
and omitting the supremum assuming finite mutual infor- 
mation between S and X) in the context of unsupervised 
learning in neural networks [4|. The rationale for this new 
measure was to circumvent some shortcomings of Linsker's 
principle of information maximization ifTTI : While infomax 
works well for Gaussian RVs and linear systems, applying 
the same algorithms to non-Gaussian data just maximizes an 
upper bound on the information. Plumbley's information loss, 
conversely, also yields closed form solutions for Gaussian data, 
but in addition to that one can derive upper bounds on the 
relevant information loss whenever the data is non-Gaussian. 
Thus, minimizing an upper bound on the information loss can 
be assumed to be more promising than maximizing an upper 
bound on the transferred information 0]. 



III. Properties of Relevant Information Loss 

We now analyze the elementary properties of relevant 
information loss: 

Proposition 1 (Elementary Properties). The relevant informa- 
tion loss from Definition Q] satisfies the following properties: 

1) L S (X^Y) <I(S;X) <H(S) 

2) L S (X -> Y) > L Y (X -> Y) = 

3) L S {X -> Y) < L X (X -> Y) = L(X -> Y), with 
equality if X is a function of S. 

4) L S {X ->• Y) = H{S\Y) if S is a function of X. 

Proof: The first property results immediately from the 
definition, while the second property is due to the fact that Y 
is a function of X. The third property results from making 
X the relevant RV, thus making (O equal to (Q~|i. Note that, 
with (Q}, 

L S (X -> Y) = I(S; X\Y) = H{X\Y) - H(X\Y, S) (3) 

< L(X -> Y) (4) 

with equality if X is a function of S. The last property, for 
S = f(X), follows by expanding I(X;S\Y) as H(S\Y) - 
H{S\Y,X) = H{S\Y). " M 

The third property is of particular interest. Essentially, it 
states that the relevant information loss cannot exceed the total 
information loss, so upper bounds (e.g., those presented in 0]) 
for the latter can be used for the former as well. In addition 
to that, it suggests an alternative way to define L(X — > Y): 
Namely as 

L(X ->Y) = sup (I(S;X)-I(S;Y)) (5) 

s 

where the supremum is over all RVs on the sample space X. 

Since by Definition Q] relevant information loss is repre- 
sented by a conditional mutual information, it inherits all of 
its properties. In particular, as we show next, a data processing 
inequality (DPI) holds: 

Proposition 2 (Data Processing Inequality). Let V — W — X — 

Y be a Markov chain. Then, 

L W (X -» Y) > L V (X -> y). (6) 
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Fig. 2. Cascade of two systems: The relevant information loss of the cascade 
is the sum of relevant information losses of the constituent systems. 
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Fig. 3. The problem of signal enhancement: Given a (relevant) information 
signal S and its observation X, how shall we design a system such that its 
output, Y, represents S as good as possible? 



Proof: See Appendix. ■ 
The particular usefulness of the DPI (especially in the 
context of our work) relies on the fact that both S — X — g(X) 
and f(S) — S — X are Markov chains. Comparing this to 
Proposition [2] one is tempted to believe that the direction of 
the inequality depends on the fact whether f(S) — S— X— Y or 
5 — f(S) — X — y is a Markov chain. The following Corollary 
resolves this complication. 

Corollary 1. Let f be a measurable function defined on the 
sample space of S. Then, 

L s (X^Y)>L f{S )(X^Y) (7) 

with equality if S — f(S) — X — Y is a Markov chain. 

Proof: See Appendix. ■ 
Before proceeding, we note that the third property of 

Proposition Q] is an immediate consequence of this corollary. 
We next show that, inherited from the properties of mutual 

information, also the relevant information loss obeys a chain 

rule: 

Proposition 3 (Chain Rule of Information Loss). The informa- 
tion loss Ls"(X — > Y) w.r.t. a collection = {Si, . . . , S n } 
of relevant RVs induced by a function g satisfies 

n 

L s? (X^Y)=Y,L Silsl - 1 (X^Y). (8) 
i=l 

The proof follows immediately from the chain rule of 
(conditional) information and is thus omitted. However, this 
chain rule justifies our intuitive understanding of the nature of 
information loss. We thus emphasize 

Corollary 2. The information loss L(X — > Y) induced by a 
function g can be split into relevant (w.r.t. S) and irrelevant 
information loss: 

L(X -> Y) = L S (X -> Y) + L xls (X -> Y) (9) 

Proof: The proof follows from Proposition |3]and from the 
fact that since S and Y are conditionally independent given 
X we have L XS {X -> Y) = L(X — > Y). ■ 

Note that even in a simple scenario with additive noise, i.e., 
X = S + N, it is not straightforward to just identify the noise 

with the irrelevant information. We will show this in one 
of the examples in Section |VT] 

A final property we want to present concerns the cascade of 
systems (see Fig. 0. Assume, for the moment, that the output 



Y of the first system g is used as the input to a second system 
h, which responds with the RV Z. In [12] it was shown that 
the information loss of this cascade is additive, i.e., 

L(X ->■ Z) = L(X -> Y) +L(Y -> Z). (10) 

The same holds for the relevant information loss, as Plumbley 
showed in [4|. We thus state 

Proposition 4 (Information Loss in a Cascade is Additive, |4|). 
Let Y be the RV obtained by transforming X with a static 
function g, and let Z be obtained by transforming Y with h. 
Then, the information loss relevant w.r.t. S is given as 

Ls(X^Z)=Ls(X^Y) + L s (Y^Z). (11) 

Proof: See Appendix. ■ 

IV. Signal Enhancement, Relevant Information 

LOSS, AND IB WITH SIDE INFORMATION 

We now turn our attention to the signal enhancement 
problem. Often, the information one wants to retrieve is not 
directly available, but only through some corrupted observa- 
tion: Information-carrying signals are superimposed by noise, 
distorted through nonlinear systems, and affected by time- 
varying, dynamic effects. It is essentially the goal of the signal 
processing engineer to mitigate all these adverse effects; to 
improve the quality of the observation such that as much 
information as possible can be retrieved from it with little 
effort. 

Looking at this task from the perspective of information 
theory, we know that, as the data processing inequality dic- 
tates, signal enhancement does not mean that one increases 
the amount of information in the observation. At best, one 
can build a system which preserves as much information as 
possible (see Fig. [3j. Conversely, noise, distortion, and other 
irrelevant components of the observation should be removed 
such that information retrieval can be done easily. 

This is where our notion of relevant information loss comes 
in: If we let S be the information carrying signal and X 
its corrupted observation, our goal is to find a function g 
such that the relevant information loss is minimized while 
simultaneously maximizing the irrelevant information loss. 
One may cast the resulting optimization problem as follows: 

max L x \s(X —> Y) 
g 

s.t. L S (X -^Y)<C (12) 



where C is some constanfl In other words, signal enhance- 
ment is the maximization of irrelevant information loss. 

Although in signal processing these information-theoretic 
considerations are only slowly gaining momentum, they are 
quite widespread in machine learning and neural networks, 
e.g., 0, 0. In particular, a variational formulation of max- 
imizing relevant information is the basis of the information 
bottleneck method (IB) 0: 

min I(Y;X)-PI(S',Y) (13) 

p(y\x) 

where the minimization is performed over all relations between 
the (discrete) RVs Y and X and where j3 is a design parameter 
trading compression for preservation of relevant information, 
I(S; Y). While in principle the relation p(y\x) can be stochas- 
tic, in many cases the algorithm is used for (hard) clustering 
using a deterministic function. We now try to express dTTt 
in terms of relevant and irrelevant information loss: Using 
L S {X Y) = I(S\ X) - I(S; Y) we obtain 

I(S;Y) = I(S;X)-L S (X ->Y) (14) 

where the first term is obviously independent of p(y\x). 
Restricting ourselves to the clustering problem, i.e., to deter- 
ministic functions Y = g{X), and letting g° be the optimal 
solution, we get 

g° = argmin I(Y;X) - /3I(S;Y) (15) 

3 

= argmin —H(X) + I(Y; X) — f3I(S; X) 
a 

+ pL s (X^Y) (16) 

= argmin -L(X -> Y) + f3L s (X ->• Y) (17) 
a 

= argmin -L S (X -+ Y) - L X]S {X -> Y) 
a 

+ (3L S {X^Y) (18) 

= argmin (/? - l)L s {X -> Y) - L xls (X -> Y) 
a 

(19) 

and thus, the optimization problem can be cast as 

min (0 - 1)L S (X -> Y) - L X[S (X -> Y). (20) 

9 

Note that for large j3 stronger emphasis is placed on 
minimizing relevant information loss, and a trivial solution 
to this problem is obtained by any bijective g. 

In the case of discrete RVs this problem can be circum- 
vented by another variant of the IB method: The agglomer- 
ative IB method proposed in lfl3l starts with Y = X and 
iteratively merges elements of the state space y of Y such 
that the relevant information loss is minimized in each step. 
By stopping this algorithm as soon as during some merging 
step Ls (X — > Y) > C, at least a local optimum of our original 
signal enhancement problem (fT2l can be found. 

A step further has been made by the authors of lfl4l . 
who extended the IB method to incorporate knowledge about 

'in some cases it may be beneficial to minimize the relevant information 
loss subject to a minimum reduction of irrelevant information, or even use a 
variational formulation. 



irrelevant signal components. They accompany the relevant 
information S by an irrelevance variable S, assuming condi- 
tional independence between S and S given X, and minimize 
the following functional: 

I(Y;X)-p[l(Y;S)- 7 I(Y;S)] (21) 

At the same time, the entropy of the system output Y and the 
irrelevant information I(Y; S) are minimized, and the relevant 
information I(Y; S) is maximized. As in IB, (3 and 7 are the 
weights for these three conflicting goals. We can again use 
our notions of relevant and, employing Corollary |2] irrelevant 
information loss to rewrite the optimization problem. If we 
restrict ourselves to clustering, one obtains 

min {(3 - l)L s (X -> Y) - (/3 7 + 1)L X[S (X -> Y) (22) 
a 

where again all constant terms have been dropped. Note that 
here X\S takes the place of S, automatically fulfilling the 
requirement of conditional independence in lfl4l . 

Unlike for IB, by employing side information it does make 
sense to let f3 — > 00: Compression emerges naturally from 
maximizing the irrelevant information loss (due to 7 > 0). 
This is exactly the approach that has been taken up by the 
authors of Q, ©. who introduced the following information 
processing measure for discrete RVs X, Y, and S: 

AP(X -> Y\S) = H(X\S) - H(Y\S) 

-a(H(S\Y)-H(S\X)) (23) 

The authors argue that the first difference in (l2Jt corresponds 
to complexity reduction (i.e., reduction of irrelevant infor- 
mation), while the second term accounts for loss of relevant 
information. Indeed, with 

L X]S (X -+ Y) = I(X; X% S) - H(X\Y, S) (24) 
= H(X\S)-H(Y\S) (25) 

we can rewrite AP(X — > Y\S) with our notions of relevant 
information loss and state a variational utility function (to be 
maximized) for our signal enhancement problem (TT2t : 

AP{X -> Y\S) = L X[S (X -> Y) - aL s (X -> Y) (26) 

Here, a > is a design parameter trading between the loss 
of relevant and irrelevant information. 

Aside from the IB method and its variants widely used 
in machine learning, many other functions and algorithms 
inherently solve our problem of signal enhancement: Quan- 
tizers used in digital communication systems and regener- 
ative repeaters have the goal to preserve the information- 
carrying (discrete) RV as much as possible while removing 
continuous-valued noise. Bandpass filters (although not yet di- 
rectly tractable using our notions of relevant information loss) 
remove out-of-band noise while leaving the information signal 
unaltered. And finally, methods for dimensionality reduction 
(such as the principle component analysis) remove redundancy 
and irrelevance while trying to preserve the interesting part of 
the observation. In the next section we focus on exactly this 
class of methods. 



V. Optimality of the PCA in terms of Relevant 
Information Loss 

We now analyze the relevant information loss of the prin- 
cipal component analysis (PCA). In this regard, we initially 
follow the reasoning of Plumbley [4| who showed that the 
PCA in some cases minimizes the relevant information loss. 
We then generalize Plumbley's results and even show that this 
loss can vanish. To this end, we define the PCA as 

Y M = g(X) = I M Y = l M W T X (27) 

where X is an A^-dimensional continuous-valued input RV, 
Ym an M-dimensional output RV (composed of the first M 
elements of Y), and W the matrix of eigenvectors of the 
covariance matrix of X, C x = E {XX T }. Thus, detW = 1 
and W _1 = W T . Furthermore, let I M be an (M x 7V)-matrix 
with ones in the main diagonal. The PCA is here used for 
dimensionality reduction and, assuming that one has perfect 
knowledge of the rotation matrix W , the relative information 
loss equals Tr M , while the absolute information loss is 
infinite J2). 

It is known that the PCA minimizes the mean squared error 
for a reconstruction Xm = WI ^Y» of X. In addition to 
that, as Linsker pointed out in [11], given that X is an observa- 
tion of a Gaussian RV S corrupted by Gaussian noise, the PCA 
maximizes the mutual information I(S; Y). For non-Gaussian 
S (and Gaussian noise with spherical symmetry), the PCA not 
only provides an upper bound on the mutual information, but 
also an upper bound on the relevant information loss PI, Il5l. 

We now want to generalize the results such that non-iid and 
non-Gaussian noise is taken into account as well. In particular, 
we show that the relevant information loss can indeed vanish 
under some circumstances. Let 

X = S + N (28) 

where S and N are the relevant information and the noise, 
respectively, with covariance matrices C s and C N . We further 
assume that S and N are independent and, as a consequence, 
C x = C s + C N . Let {Ai}, {/ii}, and {vi] be the sets 
of eigenvalues of C x , C N , and C s , respectively. Let the 
eigenvalues be ordered descendingly, i.e., 

Ai > A 2 > •■■ > \ N . (29) 

We now write 

Y M - I A /W T X = I M W T S + I M W T N - S M + %. 

(30) 

As mentioned before, Ym is composed of the first M elements 
of the vector Y. Conversely, we let Y c , S c , and N c denote 
the last N ~ M elements of the corresponding vectors. If, as 
in our case, the orthogonal matrix W performs the PCA, the 
covariance matrix of Ym is a diagonal matrix with the M 
largest eigenvalues of C x . We now present 

Lemma 1. For above signal model, the relevant information 
loss in the PCA is given by 

L S (X -> Y M ) - h(Y c \Y M ) - h(N c \N M ). (31) 



Proof: See Appendix. 

■ 

Before proceeding, we need the following 

Definition 2. Let X and Y be continuous RVs with arbitrary 
continuous (joint) distribution, and let Xq and Yq be Gaussian 
RVs with same (joint) first and second moments. We define 
the conditional divergence 

J(X\Y) = D(X\Y\\X G \Y G ) = h(X G \Y G ) ~ h(X\Y). (32) 

Clearly, this quantity inherits all its properties from the 
Kullback-Leibler divergence, e.g., non-negativity, and can be 
considered as a measure of Gaussianity. We want to point out 
that despite the similarity, this quantity is not to be confused 
with negentropy. For negentropy, instead of using a jointly 
Gaussian distribution for (Xa,Y G ) for computing h(Xo\Yo) 
one uses a Gaussian distribution for ATg|Y = y to compute 
h(Xc\Y = y) before taking the expectation w.r.t. the true 
marginal distribution of Y. 

While most results about the optimality of the PCA are 
restricted to the Gaussian case, we now employ conditional 
divergence to generalize some of these results. In particular, 
we maintain 

Theorem 1. Let X = S + N, with N and S independent, and 
let Ym be obtained by performing dimensionality- reducing 
PCA. If N is iid (i.e., C N is a scaled identity matrix) and 
more Gaussian than Y in the sense 

J(N C |N M ) < J(Y C |Y M ) (33) 

then the PCA minimizes the Gaussian upper bound on the 
relevant information loss Ls(X- — > Ym). 

Proof: See Appendix. ■ 
This theorem essentially generalizes the result by Plumb- 
ley [4|, [5], who claimed that the PCA minimizes the Gaussian 
upper bound on the information loss for spherical Gaussian 
noise (i.e., for Gaussian N with C N being a scaled identity 
matrix). In addition to that, it helps justifying our use of in- 
formation loss instead of information transfer; an upper bound 
on the latter would not be useful in our signal enhancement 
problem. 

As a next step, we restrict ourselves to relevant information 
which is concentrated on an L < A/-dimensional subspace, 
but drop the requirement that N is iid. We still assume, 
however, that C N (and, thus, C x ) is full rank. Note that due 
to these assumptions Ai > and /i^ > for all i, while v>i = 
for i > L. We are now ready to state 

Theorem 2 (Bounds for the PCA). Assume that S has 
covariance matrix C s with at most rank M, and assume that 
N is independent of S and has (full-rank) covariance matrix 
C N . Let further N be more Gaussian than Y in the sense 

J(N C |N M ) < J(Y C |Y M ) (34) 

where Ym is obtained by employing the PCA for dimension- 
ality reduction. Then, the relevant information loss is bounded 



Channel 



S €{-1,1} 



i X 



0(0 



Y 



N ~U(-a,a) ' 



{7i} 



Fig. 4. Digital Communication System: The input signal S is uniformly 
distributed on {—1, 1}, the channel adds uniform noise to the signal. The 
channel output, X, is quantized using a set of thresholds {7;} to obtain the 
decision variable Y. 
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where {pi} is the set of decreasing eigenvalues of C N and 
where In is the natural logarithm. 

Proof: See Appendix. ■ 
Note that this upper bound is non-negative, since - by 

assumption on the ordering of the eigenvalues - any term in 

the product cannot be smaller than one. 

We finally show that there are cases where, despite the fact 

that we reduce the dimensionality of the data, we are still able 

to preserve all of the relevant information: 

Corollary 3. Assume that S has covariance matrix C s with at 
most rank M, and assume that N is zero-mean Gaussian noise 
independent of S with covariance matrix cr^l. Let Yjm be 
obtained by employing the PCA for dimensionality reduction. 
Then, the relevant information loss Ls(X — > Ym) vanishes. 

Proof: The proof follows from the fact that as C N = 
ct^jI, all eigenvalues //i = • • • = /xjy = o"n- ' 
As mentioned before, due to dimensionality reduction the 
absolute information loss L(X — > Ym) is infinite; a direct 
consequence is that the irrelevant information loss, L X |s(X — > 
Ym), is infinite as well. Given the assumptions of Corollary[3] 
holds, the PCA is a good solution to our signal enhancement 
problem. 

VI. Examples 

We now try to illustrate a few of our results with the help 
of some examples. In particular, we analyze a communication 
channel with binary input and a dimensionality-reducing PCA 
for non-Gaussian data. Unless otherwise noted, log denotes 
the binary logarithm. 

A. Communication channel with uniform noise 

Consider the digital communication system depicted in 
Fig. 3] Let the data symbols S be uniformly distributed on 
{ — 1,1}, thus H(S) = 1. Let further N be a noise signal 
which, for the sake of simplicity, is uniformly distributed on 
[—a, a], a > 1. Clearly, X = S + N is a continuous-valued 
signal with infinite entropy. During quantization an infinite 
amount of information is lost, L(X — > Y) = oo [1|. The 



main point here is that most of the information at the input 
of the quantizer is information about the noise signal N. We 
will now make this statement precise. 

With the differential entropy of X given by 



1 



h(X) = -log 4a + 



1 



log 2a 



(36) 



and with h(X\S) = log 2a we get the amount of relevant 
information available at the input of the quantizer: 



1 



I(X;S) = - 



(37) 



Choosing now a single quantizer threshold 71 — 0, i.e., 
Y = sgn(X) we obtain a binary symmetric channel with 
cross-over probability P e = With -f^O being the binary 
entropy function the mutual information thus computes to 
I(Y; S) = l-H 2 (P e ) (see, e.g., JTO]) and we obtain a relevant 
information loss of 



L S (X -> Y) 



) 



a-l 



(38) 



Conversely, if we use two quantizer thresholds 71 = 1 — a 
and 72 = a — 1 we obtain a ternary output RV Y. Interpreting 
any value 71 < X < 72 as an erasure, we obtain a binary 
erasure channel with erasure probability The correspond- 
ing mutual information is computed as I(Y; S) = - (see, 
e.g., ifTOll ). and the relevant information loss vanishes. 

As this example shows, while our system actually destroys 
an infinite amount of information (to be precise, exactly 100% 
of the available information [2Q, the relevant information loss 
can still be zero. Signal enhancement, in the sense of removing 
irrelevant information, was thus successful. 

In this example it is tempting to identify the noise variable 
N with the irrelevant information X\S. However, this not 
necessarily leads to a correct result, as we show next. 

To this end, we substitute the quantizer in Fig.|4]by a mag- 
nitude device, i.e., Y' = \X\. By the fact that the probability 
density function of X has even symmetry, it can be shown 
that L(X -> Y') = H{X\Y') = 1 (e.g., Q3). One can further 
show that, since the marginal distribution of Y' coincides with 
the conditional distributions Y\S = —1 and Y\S = 1, the 
mutual information I(Y';S) = 0. Thus, L S (X ->• Y') = ±. 
As an immediate consequence, L x \s(X —> Y') = 

Let us now determine the information loss relevant w.r.t. 
N. By showing that L N (X -> Y') ^ L x \s(X -> Y') we 
argue that noise and irrelevant information are not necessarily 
identical. Observe that 

L N (X^Y')=I(X;N\Y') (39) 

= H(X\Y') - H(X\N, Y') (40) 

= 1-H{S + N\N,Y') (41) 

= 1 - H(S\N,\S + N\). (42) 

Given we know N, S is uncertain only if \S + N\ yields the 
same value for both 5 = 1 and S = — 1. In other words, we 
have to require that |iV — l| = |Af + l|. Squaring both sides 
translates this to requiring N = —N, which is fulfilled only 



TABLE I 

Information loss (relevant and irrelevant) for the 
communication channel with uniform noise. 



Loss 


Y = sgn(X) 


Y = \X\ 


L(X -» Y) 


oo 


1 


L S (X -> Y) 




1 

a i 


L x \s(X -> Y) 


oo 


a — 1 
a 


L N (X^Y) 


oo 


1 


L X{N (X->Y) 


a-1 

2a 






0) = (N is a continuous RV), 



for N = 0. Since Pr(iV 
we automatically obtain 

L N (X -+Y') = 1? — = L X | S (X -> y')- (43) 
a 

Indeed, the reason why we cannot identify the noise with 
the irrelevant information can be related to N and S not 
being conditionally independent given X. We summarize these 
findings in Table |U 

B. PCA with non-Gaussian data 

Assume we observe two independent data sources, S\ and 
S2, with three sensors which are corrupted by independent, 
unit-variance Gaussian noise Ni, i — 1,2,3. Our sensor 
signals shall be defined as 



X 1 = Si+N 1 , 
X2 = Si + S2 - 
X 3 = S 2 + N 3 . 



JVo, and 



(44) 
(45) 
(46) 



We assume further that our data sources have variances <j\ and 
o\ and are non-Gaussian, but that they still can be described by 
a (joint) probability density function. We obtain the covariance 
matrix of X = [Xi, X2, X3] as 

(71 + 1 



C x 



(71 





<7l 



(71 
- (72 
(T 2 





(7 2 
(7 2 + 1 



(47) 



Performing the eigenvalue decomposition yields three eigen- 
values, 

{Al,A 2 ,A 3 } = {(71 +cr 2 + 1 +C,(7i +cr 2 + 1 - C, 1} 

(48) 

where C = \J o\ + a\ — o\G2- 

We now reduce the dimension of the output vector Y from 
N = 3 to M — 2 by dropping the component corresponding 
to the smallest eigenvalue. Using Lemma Q] we now give an 
upper bound on the relevant loss, Ls(X — > Ym). To this end, 
note that N is iid Gaussian, and thus /i(N c |Na/) = h(N c ). 
By assumption, C N = I, and by the orthogonality of the 
transform, 

fc(N c ) = iln(27re). (49) 
Conversely, we know that 

h(Y c \Y M ) < /i(Y c>G |Y M , G ) = h{Y CiG ) = iln(27reA 3 ). 

(50) 



With A3 = 1 and Lemma Q] we thus get 
L S (X^ Y M )<0. 



(51) 



The relevant information loss vanishes. Indeed, the eigen- 
vector corresponding to the smallest eigenvalue is given as 
p.3 = -7s[l, — 1, 1] T ; thus, for Y3 one would obtain 



X, 



A3 — Xo 



V3 



Ni + N 3 - N 2 
V3 



(52) 



Since this component does not contain any signal portion, 
dropping it does not lead to a loss of relevant information. 

Note that in case the noise sources N do not have the 
same variances, the application of PCA may lead to a loss 
of information, even though the relevant information is still 
concentrated on a subspace of lower dimensionality. We make 
this precise the next example. 

C. PCA with large noise variances 

We now again assume that we use three sensors to observe 
two independent, non-Gaussian data sources which are cor- 
rupted by independent Gaussian noise: 



X 1 =S 1 + N 1 , 
X 2 = S2 + N2, and 
X 3 = N 3 



(53) 
(54) 
(55) 





"1 + 1 





" 




" 2 
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c x = 





1 + 2 










3 













3 










3 



This time, however, we assume that the data sources have unit 
variance, and that the variance of noise source N is i, thus 
{Mi> A*2, ^3} = {3,2,1}. With this, the covariance matrix of 
X is given as 



(56) 



and has eigenvalues {Ai, A2, A3} = {3, 3, 2}. 

Since C x is already diagonal, the PCA only leads to an 
ordering w.r.t. the eigenvalues; dropping the component of Y 
corresponding to the smallest eigenvalue, i.e., dimensionality 
reduction from N = 3 to M — 2, yields Y\ = N 3 and 
Y% = S2 + N2. Since we do not have access to 5i anymore, 
information is lost - in fact, PCA suggested to drop the most 
informative component of X (the one with the highest SNR). 

With Lemma Q] we can now compute the Gaussian upper 
bound on the relevant information loss: Following the proof 
of Theorem [T] 



L s (X-> Y M ) < h(Y 3tG ) 



h(Nx) = ~ln2. 



(57) 



The bound obtained from Theorem [2] is loose here, evaluating 
toL s (X^Y M )<iln3. 

Since in this case the noise is not iid, the condition of 
Theorem Q] is not fulfilled, thus the Gaussian upper bound 
is not minimized by the PCA. Indeed, by preserving X\ and 
X2 (and dropping X 3 ) the relevant information loss vanishes. 



VII. Conclusion 

In this work we presented the notion of relevant information 
loss and analyzed many of its elementary properties. We 
argued that relevant information loss is a central quantity in 
the problem of signal enhancement and thus, of system theory 
in general. A comparison with the literature about machine 
learning and neural networks revealed that many of the algo- 
rithms introduced there can be reformulated as solutions for 
the signal enhancement problem using the notion of relevant 
information loss. As an example, we discussed principle com- 
ponent analysis used for dimensionality reduction and derived 
conditions under which the relevant information loss vanishes. 

Future work will concentrate on different blocks of the 
signal processing chain, such as quantizers, sampling devices, 
and filters as well as on investigating a possible connection 
between relevant information loss and state space aggregation. 

Appendix 

A. Proof of Proposition \2\ 

The proof follows along the same lines as the proof of the 
data processing inequality for mutual information (see [10]). 
We first expand I(X; W, V\Y) as follows: 

I(X; W, V\Y) = I(X; V\Y) + I{X; W\V, Y) (58) 

= I(X;W\Y) + I{X-V\W,Y) (59) 

= I(X;W\Y) (60) 

where the last line follows from the fact that V and X are 
conditionally independent given W. 
Using now Definition Q] we obtain 

L W {X^Y)=I(X-W\Y) (61) 

= I(X;V\Y) + I(X;W\V,Y) (62) 

= L V (X ^Y) + I(X:W\V,Y) (63) 

> L V {X -> Y) (64) 

which completes the proof. ■ 

B. Proof of Corollary [JJ 

If f{S) — S — X — Y is a Markov chain, the proof follows 
immediately from Proposition [2] In case S — f(S) — X — Y, 
one gets with the proof of Proposition [2] 

L m (X ->Y) = L S (X ->Y) + I(X; f(S)\S, Y) (65) 
= L S (X -> Y). (66) 

Thus, in this case Proposition[2]is shown to hold with equality. 

Furthermore, using Definition Q] the Corollary follows im- 
mediately from [15 Thm. 3.7.1]. ■ 

C. Proof of Proposition \4\ 

For the proof, note that with Definition [T] we get 



(a) 



L S (X^Z) W I(S;X,Y\Z) 



('>) 



= I(S;X\Y,Z) + I(S;Y\Z) (68) 

= I(S;X\Y)+I(S;Y\Z) (69) 

where (a) is due to the fact that Y = g(X) and Z = h(Y), 
respectively, and (b) is the chain rule of information. ■ 



D. Proof of Lemma [JJ 

We start by noting that L S (X Y M ) = L S (Y ->• Y A /) 
since Y and X are related by an invertible transform. Thus, 

L S {Y -> Y M ) = J(Y; S) - 7(Y M ; S) (70) 
= h(Y) - h(Y\S) - h(Y M ) + h(Y M \S) 
= h{Y c \Y M ) - h{W T S + W T N|S) 



^(I M W T S+I M W T N|S) 



(71) 



(a) 



h(Y c \Y M ) - h(W T N) + /i(I M W T N) 
h(Y c \Y M ) - MN C |N M ) (72) 



where (a) is due to independence of S and N. This completes 
the proof. ■ 

E. Proof of Theorem [7J 

By Lemma Q] and Definition [2] 

L S (X ->■ Y M ) - h(Y c \Y M ) - ^(N c |N Af ) (73) 
= h(Y c , G \Y ALG )-J(Y c \Y M ) 

- h(N c , G \N M>G ) + J(N C |N M ) (74) 
< /i(Y c , G |Y M)G ) - /i(N c , g |N m ,g)(75) 



(a) 
(6) 



h{Y c , G ) - h(N CtG \N M , G ) 
h(Y c , G ) - h(N c . G ) 




(76) 
(77) 

(78) 



Here, (a) is due to the fact that the PCA decorrelates the output 
data Y and thus leads to independence of Y c G and Ym,g ( m 
the sense of Definition 0. By similar reasons (b) follows from 
the fact that N is iid (C N is a scaled identity matrix, with all 
eigenvalues being equal pi = p). Since the PCA preserves the 
M coordinates corresponding to the M largest eigenvalues of 
C Y , the last line (obtained with flTOJ Thm. 8.4.1] and US] 
Fact 5.10.14]) represents the smallest Gaussian upper bound 
and completes the proof. ■ 

F. Proof of Theorem [2] 

We recapitulate d76l l from the proof of Theorem Q~| and get 

L S (X -> Y M ) < h(Y c>G ) ~ h{N CiG \N M , G ) (79) 
= /i(Y C)G )-/ l (N G ) + /i(N M>G ). (80) 

With [10;. Thm. 8.4.1] and |[T6j Fact 5.10.14] we get 



fo(N G ) = iln (^2ne) N f[^ 



(67) and 



h(Y, 



In (27re) 



N — M 




(81) 



(82) 



If now C-^j denotes the {M x M )-covariance matrix of N A / 
(and, thus, of Nm.g) an d {&} me set of eigenvalues of C^, 



we obtain 



r f*v > v \ ^ l Ili=A/+l Ai 

£ S (X Y M J < - In I - 



n l= i Mi 



(83) 



We now complete the proof by providing upper bounds 
on the eigenvalues in the numerator. It is easy to verify 
that Cj^ is the top left principal submatrix of W T C N W 
(which, by the orthogonality of W has the same eigenvalues as 
C N ). As a consequence, we can employ Cauchy's interlacing 
inequality [16, Thm. 8.4.5]: 



fJ-i+N-M < Mi < fM 



(84) 



The second bound, A; < fii, is derived from Weyl's inequal- 
ity El Thm. 8.4.11] 



A,; < Vi 



(85) 



and by noticing that Vj = for all j > M. 

Combining all this, we obtain as an upper bound on the 
information loss 

L S (X -> Yjvf) < - In (86) 

<\^(^M±ipJ^) (87) 



■r-rJV 




(88) 



This completes the proof. ■ 
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